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PACS 87 . 33 . Ge - Dynamics of social networks 
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Abstract -We introduce a new method to efficiently approximate the number of infections re- 
sulting from a given initially-infected node in a network of susceptible individuals. Our approach 
is based on counting the number of possible infection walks of various lengths to each other node 
in the network. We analytically study the properties of our method, in particular demonstrating 
different forms for SIS and SIR disease spreading (e.g. under the SIR model our method counts 
self-avoiding walks). In comparison to existing methods to infer the spreading efficiency of differ- 
ent nodes in the network (based on degree, k-shell decomposition analysis and different centrality 
measures), our method directly considers the spreading process and, as such, is unique in providing 
estimation of actual numbers of infections. Crucially, in simulating infections on various real-world 
networks with the SIR model, we show that our walks-based method improves the inference of 
effectiveness of nodes over a wide range of infection rates compared to existing methods. We also 
analyse the trade-off between estimate accuracy and computational cost, showing that the better 
accuracy here can still be obtained at a comparable computational cost to other methods. 
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Introduction. — Epidemic spreading in biological, so- 
cial, and technological networks has recently attracted 
much attention (see for instance [l]-f7]) . Most of these 
studies focus on the following question: Assume that we 
first infect a randomly chosen individual of the network 
(patient zero) - how likely is it that a substantial part of 
the network will be infected? In these earlier approaches 
the network was considered as a whole and the role of pa- 
tient zero on the disease spreading process was neglected. 

In this letter, we consider the role a single individual 
plays in the spreading process rather than the global prop- 
erties of the network. It is of particular interest to identify 
the most influential spreaders, and to do so without ex- 
pensive simulations. This knowledge could, for instance, 
be used to prioritise vaccinations for the most influential 
spreaders. The number of neighbours of an individual is 
a simple but crude approximation for an individual's in- 
fluence, and one has to take further topological properties 
of the network into account to understand the spreading 
process adequately [Sj[9] . As such, [5j 8, 9 propose dif- 
ferent inference measures for a node's spreading influence 



such as the fc-shell decomposition, the local centrality mea- 
sure or eigenvector centrality[J All of these approaches 
show strong correlations between their measure of influ- 
ence and the (simulated) number of infected nodes. There 
is potential for improvement however in: i. considering 
network features encountered by longer infection walks, 
and ii. addressing the ultimate goal of estimating the ex- 
pected number of infections rather than merely obtaining 
correlations. Importantly, one can only predict whether 
an infection will be epidemic (i.e. a large portion of the 
network will be infected) or harmless from an estimate of 
infection numbers, not from correlation scores of an infer- 
ence method alone. In this letter, we present an approach 
based on counting the number of potential infection walks 
from a given initially infected individual. Our method 
overcomes the above issues and allows us to consistently 
estimate with very good accuracy the expected number of 
infections from each patient zero. Moreover, our method 
is very efficient and has low computational costs. 



( a ' E-mail: bauerOmis.mpg.de, joseph.lizier@csiro.au 



1 For directed networks other methods exist for ranking the in- 
fluence of nodes in different dynamical contexts (e.g. ranking re- 
searchers according to influence on the scientific community |10|jl2|.) 
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The general model. — Wc consider epidemic spread- 
ing on network structures. A complex network can be 
identified with a graph T = (V, E) r\ (here V is the vertex 
and E is the edge set) in an obvious way. We say that i 
and j are neighbours, in symbols i ~ j, if they are con- 
nected by an edge. In general, we deal with undirected 
graphs, though our formulae are trivially extended for the 
directed case. Often it is convenient to describe a graph by 
its adjacency matrix A = (ffly) j j—\ ... iyi where the matrix 
element a^- = 1 if i and j are neighbours and zero oth- 
erwise, and \V\ is the number of vertices. Furthermore, 
di = J2j a ij denotes the (out) degree of the vertex i. 

First we consider a generalization of the SIS 
(susceptible/infected/susceptible)-model and the SIR 
(susceptible/infected/removed)-model. In our model a 
disease is spread in a network through contact between 
infected (ill) individuals and susceptible (healthy) indi- 
viduals. At a given time step, each infected individual 
will infect each of its susceptible neighbours with a given 
probability < /3 < 1 (for simplicity we assume that f3 is 
the same for all pairs of vertices - however generalization 
to variable /3y is straightforward). An infected individual 
will be removed from the network with probability (1 — A) 
(modelling either death or full recovery with immunity); 
otherwise, an infected individual remains in the network 
with probability A and remains susceptible to (re)infcction 
at the (very) next time step. For A = and A = 1 this 
model reduces to the SIR and SIS model, respectivelyjj 

For a given network, we want to find the expected num- 
ber of infections given the person that was infected first. 
It is natural to think of the spreading process in terms of 
infection walks in the corresponding graph. The degree of 
a vertex is a first indicator of how many individuals it will 
infect, however this neglects all infection walks of length 
greater than one - see also [9] where the role of longer infec- 
tion walks in epidemic spreading is discussed and numer- 
ical simulations were performedrl Moreover such walks 
play a very important role in the dynamics of complex 
networks. In the following we will show that it is crucial 
to also take longer infection walks into account in order to 
get precise results. 

The probability p(i,j,k) that vertex j is infected 
through a walk of length k given that the infection started 
at vertex i can be written as 



P(i, j, k) = Pinf (i, j, k)p sus (i, j, k), 



(1) 



2 For simplicity, we do not allow self-loops or multiple edges. 

3 This is slightly different to the usual discrete-time SIS model, 
where infected individuals must return to susceptible at the next 
time step before reinfection is possible. In our interpretation, all non- 
removed nodes are susceptible (i.e. infected and susceptible are not 
mutually exclusive). This difference allows us to mathematically 
generalise and study smooth transitions from the SIR to the SIS 
model, using the walk-counting approach. 

4 This study used walk counts from a source node as a predictor 
of spreading efficiency. However, unlike the technique we present 
it did not convert those counts into a direct estimate of infection 
numbers, nor did it consider the appropriate types of walks (i.e. one 
must consider self-avoiding walks for the SIR model). 



where Pi n {(i, j, k) is the probability that vertex j is infected 
at time k given that vertex j is susceptible at time fc, and 
Psus(hj,k) is the probability that vertex j is susceptible 
(i.e. not removed) at time k, both given that the infec- 
tion started at vertex i. (We refer to time here since the 
spreading process is updated at discrete time steps; hence 
it is equivalent to say that vertex j is infected through a 
walk of length k or infected at time k). 

For general graph topologies it is difficult and expen- 
sive to calculate the p(i,j 7 k) exactly. In the subsequent 
analysis we show how each of Pi n f(«, j, k) and p su s{hj,k) 
in turn can be approximated when we make the following 
reasonable simplification assumption: We assume that all 
infection walks (of the same as well as different lengths) 
are independent of each other, i.e. we treat them as if 
they have no edges in commonF] As our simulation re- 
sults indicate, this is a reasonable approximation. Using 
this independent walk assumption, we approximate: 



Pinf(«,j,fc) ~ 1ini{i,j,k) = 1 



[] (l-p(Prn))- 



V(i, j, k) is the set of all walks from i to j of length k 
and p{P m ) is the probability that the infection takes place 
along the walk P m . This formula is easily obtained by 
noting that JT p eV , { . k Jl—p(P m )) is the probability that 
j is not infected at time k given that it was susceptible 
and that the infection started at vertex i. It is insightful 
to rewrite the last equation in the following form: 
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at (i,J,k) = l-J\(l-\ l (3 k ) 



(2) 
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where s i ■ is the number of walks from i to j of length k 
with I repeated vertices, i.e. the number of walks consist- 
ing of k + 1 — I different vertices (including i and j). 

Let us have a closer look at the relationship be- 
tween p- m f(i,j,k) and qi n [(i,j,k). To properly com- 
pute Pi n f (i, j, k), one must compute infection probabili- 
ties on each walk P m in some order, and properly con- 
dition these infection probabilities on those of overlap- 
ping previously considered walk in {P 1; P 2 , . . . , P m _i}. 
This leads to properly conditioned infection probabilities 
p(P m \Pi, P2, . . . , P m _i) and the expression: 

Pini{i,j,k) = l- [] (l-p(P m |P 1 ,P 2 ,...,P m _ 1 )). 

P m ev(i,j,k) 

Now, if infection has not already occurred on one of 
these previously considered walks {Pi,P 2 , • • ■ , Pm-i}, 
then p- m f(i,j, k) only differs from q- m f(i,j,k) where P m 
has any overlapping edges with {Pi, P2, • • ■ , P m -i}- Since 
infection did not occur on any of these walks with 
shared edges, then some of the shared edges for the 



5 For walk lengths less than or equal to two this assumption is not 
needed since all walks are independent anyway. 
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P m may in fact already be closed (i.e. dropping 
Pi, P 2 , . . . , P m _x) below p(P m )). This yields: 



Pint(i,j,k) < q in f(ij,k) for all fc, 



(3) 



i.e. the independent walk assumption leads to an overes- 
timation of p- ln i(i 1 j, k). 

Let us now study p sus (i,j,k) in more detail. We intro- 
duce p rcm (i,j, fc) := 1 — Psus{hj, k), (see footnote pH) i.e. 
the probability that vertex j is removed at time fc given 
the infection started at vertex i. For all t > 1, we have: 

Picm(i, j,t + 1) = p Te m(i,j,t) + p sns (i,j,t)p inl (i,j,t)(l - A) 

=^Psus{i,j,t+ 1) -p sus (i,j,t) = -(1 - X)p(i,j,t) (4) 
Summing Q over all t from 1 to fc — 1 we obtain: 

fc-i 
Psus(i,j,k) = 1 - (1 - A)^]p(i,j,t), (5) 

t=i 

where we used p sus (i,j, 1) = 1. As before, we use the in- 
dependent walk assumption to approximate p S us(h j, k) by 
q SU s(h j, k), and also pli, j, fc) by q{i, j, k) where we define: 



q(i,j,k) := qini(i,j,k)q sus (i,j,k). 
So by analogy to ([5]) we define: 



(6) 
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gsus (i,j,fc):=l-(l-A)^ g (z,j,t), Vfc. (7) 

t=\ 

We observe that q sns satisfies an equation similar to Q: 

q S us(i,J,k) = q SUB (i,j,k- 1) - (1 -X)q(i,j,k- 1). 

We then consider the connection between p sus (i,j, k) and 
<7sus(*, J) &)• In order to investigate this we note that walks 
of length one and two always satisfy the independence 
assumption. Hence we have Pi n {{i,j,k) — qint(i,j,k) and 
Psus(i,j, k) = qsnsiij, k) for k = 1,2. 

Now we will prove by induction that p S us{hj,t) > 
q sus (i,j,t) for all t. First we assume this is true for t = k 
(as demonstrated for k = 1,2 above). Then considering 
i = k + 1, combining (IT]) and 0) we have: 

Psus(i,j, k + 1) = p sus (i,j,fc)(l- (1- A)pinf(i,J,fc)) 

> g sus (i, i,fc)(i - (i - X)qint(i,j,k)) 

= q S us{i,j,k + l), 

where we used (pi) and our inductive assumption 
Psus(i,j, k) > q sus (i,j, k). Hence we conclude that: 



Psus{i,3,k) > q sus (i,j,k) Vfc. 



(8) 



That is, we systematically underestimate the probability 
p sus of being susceptible. Together with the observation 
that we systematically overestimate the probability of be- 
ing infected in ^, these opposite effects of our indepen- 
dent walks assumption may balance each other in (pi). 



We define the impact of vertex i (the estimated number 
of infections given that vertex i was infected first) as 

L 

Ii := lim Ii(L) = lim VV?(i,],J;). 

L— >oo L— foo *• — ' ^ — * 

Ii counts the total number of infections, and so is not re- 
quired to converge if A > since then some of the vertices 
might be infected several times. Alternatively some other 
studies define outbreak size as the number of nodes in- 
fected at least once, though this does not inform one as 
to whether the infection will die out or not. (Note that 
if the nodes cannot be infected several times, i.e. for the 
SIR model, both aforementioned quantities coincide.) It 
is easy to verify that in considering only walks of length 1, 
the impact of vertex i is I,(X = 1) = f3di. This shows that 
the degree di of vertex i is the first order approximation of 
the impact p. In order to calculate the q(i,j,k) we need 
to know all the s ■ ■ . The calculation of sA from i to all j 
can be completed in 0(D k ) steps (average case of counting 
walks along homogeneous out-degree D nodes with inde- 
pendent edges), and is the computational-time bottleneck 
for our method. (We propose later an asymptotically more 
efficient calculation for the SIS case) . Crucially, while this 
asymptotic scaling is the same as for simulating the disease 
spreading process, the constant factor of proportionality 
for our technique is smaller by several magnitudes jj 

In the following, we restrict ourselves to the special cases 
A = 0, 1 where we obtain the SIR and SIS models. 

The SIS-model. — The SIS model corresponds to 
the case A = 1, i.e. where infected nodes always become 
susceptible again (i.e. do not die or become immune). 
Examples of SIS-type disease spreading include computer 
viruses and pests in agriculture where the individuals 
(computer/crops) do not develop immunity against the 
disease and hence can be re-infected again. 

To compute q- ul i{i,j 1 k > 1), one has to count the num- 
ber Sy of different walks of length fc between i and j, i.e. 
the number of possible infection walks with any number 
of repeated vertices I. Crucially, sf- is given by the ij- 
th entry of the fc-th power A k of the adjacency matrix 
A, which is computed in low-order polynomial time, mak- 
ing our method asymptotically much more efficient than 
simulations for the SIS special case. By equation (pi) the 
probability p(i,j, k) that j is infected by i through a walk 
of length fc is then approximated by (with A = 1): 



9 (z,j,fc) = l-(l-^) s 



(9) 



6 The expected number of evaluations e per simulation consists 
of D evaluations of disease spread to each neighbour plus the same 
expected number of evaluations e per Dj3 infected neighbour (on 
average, in the sub-critical regime); i.e. e = D + D/3e. One can 
solve e = D/(l — D/3), but it is more useful to write this to limited 
walk length k as 0(D k j3 k ~ 1 ). Crucially, we require the number of 
repeat simulations to be 2> l//3 ft ~ 1 for proper sampling, and new 
simulations are required for each /3. These two requirements push the 
constant factor orders of magnitude beyond that for our technique 
since we only need to calculate the SAWs once for all /3. 
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where we used q SU s(h j, k) = 1 since A = 1 here. We obtain: 



tSIS 



— lim 



EE> 

k=i jev 



{i-f3 k y 



(10) 



(using 0° = 1 by convention if we allow j3 = 1). Again we 
point out that this expression might not converge (partic- 
ularly if is too large) since individuals can be infected 
several times. Such divergence has a meaningful interpre- 
tation: i.e. that the infection will remain forever in the 
network and will not die out. In ( 10 1 we take arbitrarily 



long walks into account since the vertices cannot develop 
immunity against the disease and so no upper bound for 
the maximal length of an infection walk exists. 

For other diseases it is more natural to assume that the 
vertices can develop immunity after infection. 

The SIR-model. — The SIR model corresponds to 
the case A — 0, i.e. where infected nodes never become sus- 
ceptible again after infection as the individuals develop im- 
munity or die. As far as infection spreading is concerned, 
they are considered removed (i.e. they cannot spread the 
virus, nor be reinfected). Examples of SIR-type disease 
spreading include most diseases spread among humans. 
Since a vertex cannot be infected twice, we have to mod- 
ify our previous considerations appropriately. Instead of 
general walks we now have to consider self-avoiding walks 
(SAWs) or paths. Indeed, it has been previously suggested 
that an understanding of SAWs would be useful in epi- 



demiology 13 , though this was not properly investigated. 



It is well known that, compared to counting walks, it is 
much more difficult to count SAWs in a graph 14 . How- 



ever, instead of explicitly calculating the number of SAWs 
one can obtain the number of SAWs recursively |15| . In 
particular, the number of SAWs from i to j of length k + 1 
is given by: 



fc+1,0 
ij 



(T) 



%°F\3), 



g:g~j in T 



kfli 



for k > 1 where sA (T \ j) is the number of SAWs from i 
to a neighbour g of j (in T) of length k in the graph that is 
obtained from T by removing the vertex j. The adjacency 
matrix of the graph T \ j is obtained from the adjacency 
matrix of T by deleting the j-th row and column. 

Noting that there cannot exist a SAW of length k > 
\V\ — 1, we obtain for the overall expected number of in- 
fected vertices starting from vertex i (with A = 0): 



tSIR 



\V\-1 

EE 

fe=i jev 



fc-1 



(i-(i-/3 fc r- : °) i -£<#'.?>*) 



(ii) 

We write lf IR {L) to represent estimates with the sum 
over paths k limited to maximum path length L. 

Simulation results. — We provide a brief application 
of our technique to simulations of SIR spreading phenom- 
ena using: a. the social network structure generated from 



email interactions between employees of a university 16 
(giant-component with 1133 nodes and 5451 undirected 
edges, diameter 8); b. the structure of the C. elegans neu- 



ral network 17 18] (297 nodes and 2345 directed edges) 
to demonstrate a directed network, and c. the collabora- 
tion network of the arXiv cond-mat repository |19| (giant- 
component with 27519 nodes, 116181 undirected edges, 
diameter 16) to demonstrate a larger network. 

For each network, we compute estimates lf IR {L) from 



( 11 ) for maximum (self-avoiding) walk lengths L — 1 to 7 
(max. of 5 for the cond-mat network), with variable in- 
fection rate /3, for each patient zero i. To investigate the 
accuracy of these estimates, we also compute numbers of 
infections for each patient zero i and /3 as averages Sf IR 
over 1000 simulations (10000 for the cond-mat network). 
Furthermore, to compare the accuracy of inferences of the 
relative effectiveness of each node, we have also measured 
the k-shell [8] and eigenvector centrality for each node in 
the network (these measures were suggested as useful in- 
ferrers of relative spreading efficiency from each node in 
M and [9]). Using Java code on a 2.0 GHz Intel Xeon 
CPU E5-2650, the 10000 simulations Sf IR for all nodes 
and (3 values were completed for the cond-mat network 
in around 2000 hours; our estimates lf IR (L) were com- 
pleted for L — 4 and 5 in 30 and 60 minutes respectively; 
while with Matlab scripts the degree and eigenvector cen- 
trality were completed in less than one second each and 
fc-shell computed in less than 30 seconds. We note that 
computation of the relevant walks sf for SIS models is 
significantly more efficient than for SIR, since they can 
be directly computed from A k (as described earlier). We 
chose to perform SIR simulations here in order to provide 
a greater computational challenge for our technique. 

The extent to which our estimates accurately represent 
the relative spreading effectiveness from each patient zero 
is examined via the correlation of estimates lf IR {L) to 
simulated results Sf IR for the various networks in Fig. lj 
as well as via their rank order correlations (defined in [9 ). 
These figures demonstrate that our estimates lf IR {L) 
consistently provide very accurate assessment of relative 
spreading effectiveness of the nodes over large ranges of /? 
for all networks examined, in particular for L > 1 and for 
/3 values in the sub-critical, critical, and the lower-end of 
the super-critical spreading regimes. (Critical spreading is 
defined as j3 — /3 C — l/a max '9 , where a max is the largest 
eigenvalue of the adjacency matrix A. Fig. [I] indicates j3 c 
and also P-mb where 30 % of the network is infected on 
average (upper super-critical regime) - the number of in- 
fections continue to increase very quickly beyond this /?). 

The correlation results for I~ IR (L) generally improve as 
L increases. Estimates up to only short path lengths L do 
not properly capture the effects of spreading on the net- 
work structure further away from patient zero when these 
nodes become more vulnerable at larger /3. In particular, 
L = 1 captures only the out-degree of the initially infected 
node, and therefore does not represent any network struc- 
ture more than one hop away. In general then, one faces 
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(a) Email correlation 



(b) Email rank order correlation 
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the network, the structure surrounding each node makes 
less of an impact on the spreading efficiency. 

Crucially, the accuracy achieved by our lf IR (L) can 
only be matched by large numbers of simulations, which 
take significantly longer runtime. Fig. IT] shows that, for 
the cond-mat network, using only 10 repeat simulations 
(with runtime double that of lf IR (L = 5)) provides much 
worse correlations, while comparable correlations up to the 
lower super-critical regime can only be achieved with 1000 
simulations which cost around 200 x more runtime. As 
deduced earlier, our technique has asymptotically faster 
runtime by a significant constant factor. 

Further, our estimates lf IR (L) are consistently more 
accurate with L > 2 than the fc-shell inference, for values 
in the sub-critical, critical and early super-critical regimes. 
A similar conclusion holds against the eigenvalue central- 
ity measure of [9], for all but a couple of values near 



(c) C. elegans correlation (d) C. elegans rank order corr. the critical regime in Fig. l(f)| indeed, eigenvector cen 
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(e) Cond-mat correlation 



(f) Cond-mat rank order corr. 



Fig. 1: Correlations and rank order correlations of various esti- 
mates of spreading effectiveness to simulated infection numbers 
on the structures of the email interaction, C. elegans neural 
and cond-mat networks. Results are plotted for estimates ob- 
tained using fc-shells (red circles), eigenvector centrality (pur- 
ple diamonds), out-degree or lf IR (L = 1) (blue triangles), 
our estimation technique lf IR (L) using self-avoiding walks of 
length L = 2 to 5 for cond-mat and 7 for the other networks 
(greyscale x , darker gray-black to indicate longer walk lengths 
L), and for estimates from smaller numbers of simulations (10, 
100 and 1000, which increase in accuracy) for the cond-mat net- 
work only (green +). Vertical (left) green lines indicate critical 
spreading at fi c and (right) red lines indicate fisdB with 30 % 
of network infected on average. 



a trade-off between accuracy of inference of effectiveness 
against shorter computational time. Importantly though, 
very good results can be obtained with short path lengths 
L, with the results from L = 4 say being almost indistin- 
guishable from longer L for most of the range of 0. This is 
a crucial point, since the runtime for the computations for 
L < 4 is much faster than simulations, and is on the order 
of the runtimes for the more simple degree (L = 1) and 
fc-shell inference methods for the small networks. Finally, 
we note that the accuracy of the method drops once is 
well-inside the super-critical regime (even for large L) due 
to: i. insufficient path length at high 0, ii. our indepen- 
dent walks assumption becomes less valid at high L and 
0, and hi. with most nodes infecting a large proportion of 



trality only achieves comparable accuracy near the criti- 
cal regime. These are crucial results covering the regimes 
of importance (since in the deeper super-critical regime, 
a large proportion of the network becomes infected and 
understanding the spreading efficiency of various initially 
infected nodes becomes less important). Though our esti- 
mates are less efficient than fc-shell or eigenvalue central- 
ity, they are still much faster than simulations, and these 
results suggest a strong advantage to using lf IR {L). 

Additionally, we emphasise that while other tools such 
as the fc-shell and eigenvector centrality are useful for in- 
ferring the relative spreading effectiveness of each node, 
they do not actually estimate the number of infections 
from each node. This is a distinct feature of our approach 
as compared to these other methods. In Fig. [2] we directly 
compare our estimates lf IR (L) to the simulated values 
gSiR f or eacn n ode i, for several values of 0. This clearly 
shows that our technique provides reasonable estimates of 
the expected number of infections across the sub-critical 
spreading regime and up to criticality for L > 4. Indeed, 
reasonable accuracy can still be obtained with larger L 
into the critical regime, though the time-efficiency bene- 
fits of doing so (as compared to simulation) declines. 

Fig. [2] also demonstrates quite well the manner in which 
estimates improve (in general) with increasing maximum 
considered path length L. We see that, while using too 
small a maximum path length L appears to be the largest 
contributor to the inaccuracy of lf IR (L) (serving to pull 
points above the line Sf IR = lf IR (L)), other errors are in- 
troduced by our approximations in ([3]) and ([8]) (the former 
of which pulls the points below this line by overestimating 
the probability of infection) . As previously stated though, 
for reasonable and large enough L, these errors seem to 
roughly cancel. Importantly also, while the correlation of 
estimates to simulated infection numbers may not always 
increase with L in the supercritical regime, larger L values 
provide consistently more accurate estimates of infection 
numbers (e.g. see = 0.12 in Fig. [TJ and Fig. [2]). 

Finally, we consider a simple heuristic to determine an 
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(b) p = 0.040 



(c) p = 0.050 



(d) /3 = 0.120 



Fig. 2: Simulated Sf versus estimated lf IR {L) infection numbers, for each patient zero and maximum path lengths L 
to 7, using the email interaction network structure. The straight green line represents the ideal plot Sf IR = lf IR {L). 



appropriate L length to use. For potential infection walks 
from patient zero of length L, and for small /3 in the sub- 
critical regime (in particular with d f3 < 1), one can make 
a naive estimate of the expected numbers of infections at 
length L as (d (3) L , where d is the average out-degree of 
the network. One can then compute minimum L values 
to keep (d f3) L below a given value r. For instance, in the 
email network (with d = 9.62), using r = 0.02 suggests 
that with L > 4 and /? < 0.039 we will only neglect infec- 
tions on walk lengths where the expected number of infec- 
tions was below 0.02. Of course, this is a simple estimate, 
neglecting the effects of dependent walks and making an 
implicit assumption that this r = 0.02 is not large enough 
to significantly alter the number of infections; however it, 
along with observing from the diameters that many walks 
can be captured with L = 4, helps to explain why L = 4 
provides good results even as /3 approaches criticality 

Conclusions. — We have presented a method for effi- 
ciently estimating the number of resulting infections from 
a given initially infected node in a network model. This 
technique focusses on counting the number of functional 
walks to each candidate for infection, and in SIR mod- 
els the only type of walks of interest are self-avoiding 
walks. Our technique is distinct from other recently pro- 
posed measures to infer spreading effectiveness of each 
node because it focusses specifically on disease spreading 
walks rather than general measures of network topology. 
We demonstrated our technique to provide consistently 
more accurate inference of spreading effectiveness than 
other candidate techniques such as fc-shells up to the lower 
super-critical regime. This accuracy improvement is ob- 
tainable in reasonable computational time for SIR models, 
while still more accurate assessment can be obtained by 
increasing the computational time, and faster assessment 
can be made for SIS models. Our technique is also distin- 
guished in specifically estimating the number of infections, 
and we demonstrated that these estimates are reasonably 
accurate over a large range of /3 for large enough values of 
the maximum counted walk lengths L. 
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